4. Visualization Techniques: Distributions#

4.1. General Guidelines for EDA#

It is difficult to say, what the best process is in Exploratory Data Analysis (EDA). I just provided a number of analyses and possible visualizations. You find many alternative suggestions, and I provide one of them:

  • Begin with a discussion of the “center” of the data, generally based on mean.

  • Describes how data are distributed.

  • Follow with a discussion of variability (and of skew if appropriate).

  • End with a summary evaluation which may have a subjective component (numbers must be interpreted, they don’t speak for themselves).

  • Make sure to use numbers in a description wisely – not too few or too many.

However, each dataset is very special, thus, as often it stays to be an individualized process. Therefore, reproducibility is very important during EDA. Computational notebooks such as R Markdown (used here), but also Jupyter notebooks (used in our exercise), Observable (google product) support readability and understandability of your exploration process. This supports Knuth’s vision of literate programming.

4.2. Tools and Libraries for Data Exploration#

Besides GNY R there are many tools and libraries that support the process of data exploration. In this section, I provide a selection.

Wrangler provides data-transformation scripts within a visual, direct manipulation interface augmented by predictive models [Kandel et al., 2011]. Wrangler is an interactive system for creating data transformations. It uses semantic data types, such as geographic locations, dates, classification codes to support the validation of data and the type conversion. Interactive histories support review, refinement, and annotation of transformation scripts. The researchers provide a WebApp but also a commercial product.

A similar approach is realized by Open Refine. This tool was formerly developed by Google but it is now maintained by the open source community. It allows you to clean data, transform it from one format into another, or extend you data by additional data from an API.

Besides DataWrangler, I would like to mention the tool Voyager [Wongsuphasawat et al., 2015]. The Voyager system is specifically suitable for exploratory visual analysis. You can again test it via a WebApp and the source code is available on Github. And for the sake of completeness, there is also a commercial software Tableau, which can be freely used in the educational context.

Both, Data Wrangler and Voyager are using a formal language - the Vega-Lite visualization grammar. Vega-Lite is a high-level grammar of interactive graphics that provides a concise, declarative JSON syntax to create diagrams for data analysis and presentation. Vega-Lite specifications describe visualizations as ``encoding mappings’’ from data to properties of graphical marks (e.g., points or bars).

The Vega-Lite compiler automatically produces visualization components including axes, legends, and scales. Vega-Lite supports both data transformations (e.g., aggregation, binning, filtering, sorting) and visual transformations (e.g., stacking and faceting). Moreover, Vega-Lite specifications can be composed into layered and multi-view displays, and made interactive with selections as you can see in the example.

Vega-Lite is being used in another interesting library. On top of the Vega-Lite JSON specification a simple API was built: Altair. It is a declarative statistical visualization library for Python. Source code and a comprehensive documentation are available on Github.

4.3. Understanding the Data Structure#

In this section, we focus on understanding the structure of our data by employing Exploratory Data Analysis (EDA). EDA is an approach to analyzing data sets to summarize their main characteristics by using data visualizations. In 1970 John Tukey [Tukey, 1977] introduced EDA with this seminal book on this topic. He was an extraordinary scientist who had a profound impact on statistics and computer science[2]. Much of what we cover in EDA today is based on his work. Part of EDA is the so-called initial data analysis (IDA) . IDA focuses on identifying data inconsistencies (e.g., missing values) and the description of the data properties; thus, EDA encompasses IDA.

Explorative Data Analysis allows the data analysts to achieve a richer qualitative understanding by ‘’looking at data to see what it seems to say’’ . EDA should be understood as an iterative process that supports the following:

  • the search for answers by visualizing, transforming, and modeling your data,

  • the generation of hypotheses about what might be happening in a data set, and

  • the refining of your analysis goals or the generation of additional goals.

This step should not be underestimated since data analysts spend much of their time (sometimes 80% or more) cleaning and formatting data to make it suitable for analysis, then actually carrying out the analysis.

EDA is based on three principles: (1) Continuous openness and re-expression, (2) Initial skepticism, and (3) Exploratory versus confirmatory. Rather than immediately imposing a model on the data that may obscure important details, EDA analysts try to find patterns in the data and describe them with simple summary statistics (descriptive statistics). It may take several iterations for the analyst to reach a satisfactory summary or ‘’smoothing’’ of the data[3] Re-expressions or transformations of the data are essential for smoothing because they help the analyst identify new patterns. Because EDA analysts assume that there is no uniquely correct numerical summary of a data set, they are very skeptical of initial numerical summaries. Numerical summaries and smoothings are constantly tested against the raw data to ensure that they adequately represent the data. To identify patterns and look for data points that do not fit the smooth part (outliers), EDA analysts rely heavily on visualization. By supporting data exploration, EDA helps researchers generate hypotheses. These hypotheses can later be tested with formal confirmatory procedures using inferential statistics.

In summary, your goal during the EDA is to develop an understanding of your data. The easiest way to accomplish this is to use questions to guide your investigation. When you ask a question, the question focuses your attention on a particular part of your data set and helps you decide which graphs, models, or transformations to make.

EDA is a creative process [Wickham and Grolemund, 2017], thus, the key to asking meaningful questions is to generate a large number of questions . Of course, it is very challenging to generate these questions at the beginning because you are not familiar with the dataset. On the other hand, each new question you ask will expose you to a new aspect of your data and increase your chance of discovery . You can quickly break down the most interesting parts of your data - and develop a thought-provoking set of questions - if you follow each question with a new question based on your findings. This challenge has already been formulated by Tukey:

Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise. — John Tukey (The future of data analysis. Annals of Mathematical Statistics 33 (1), (1962), page 13)

There is no rule about what questions you should ask to guide your research. However, two types of questions will always be useful for making discoveries in your data. You can phrase these questions loosely as (1) What kind of variation occurs within my variables? and (2) What kind of co-variation occurs between my variables?

4.3.1. Using Python for Data Exploration#

In the following, we address these two questions based on the example of Héctor Corrada Bravo from the EDA chapter of his course on ‘’Introduction to Data Science’’ from the Center for Bioinformatics and Computational Biology from the Univ. of Maryland, but using Python instead of R.

We will use Python in combination with the libraries Pandas and Altair. However, you can follow these steps with any programming language at hand. I would like to provide you an methodological understanding of how to explore data, rather than provide an introduction into Python.

Pandas is a library vor data analysis and therefore very well suited for evaluating and preparing your data to visualize them afterwards. Here you can find a short introduction to Pandas.

Vega-Altair is a declarative statistical visualization library for Python. With Vega-Altair you can transform your data in various kinds of visualizations. Here you can get a short overview of Vega-Altair.

4.3.2. Visualizing Data#

In the following, we use the on-time data for all flights that departed NYC, i.e., JFK, LGA or EWR, in 2013. The Bureau of transportation statistics has released these data, and it was included into Python. Let’s get an overview about this dataset.

from nycflights13 import flights

# Show the internal structure of the data frame (= flights)
flights.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 336776 entries, 0 to 336775
Data columns (total 19 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   year            336776 non-null  int64  
 1   month           336776 non-null  int64  
 2   day             336776 non-null  int64  
 3   dep_time        328521 non-null  float64
 4   sched_dep_time  336776 non-null  int64  
 5   dep_delay       328521 non-null  float64
 6   arr_time        328063 non-null  float64
 7   sched_arr_time  336776 non-null  int64  
 8   arr_delay       327346 non-null  float64
 9   carrier         336776 non-null  object 
 10  flight          336776 non-null  int64  
 11  tailnum         334264 non-null  object 
 12  origin          336776 non-null  object 
 13  dest            336776 non-null  object 
 14  air_time        327346 non-null  float64
 15  distance        336776 non-null  int64  
 16  hour            336776 non-null  int64  
 17  minute          336776 non-null  int64  
 18  time_hour       336776 non-null  object 
dtypes: float64(5), int64(9), object(5)
memory usage: 48.8+ MB

The first line shows the dimension of your data frame[4], and then each of the columns (attributes) and show with their respective datatype.

Understanding the structure of the dataset is quite useful, since it allows you to get an overview on the available data types. A good understanding of the different data types is an important prerequisite for EDA, because you can use certain statistical measurements only for certain data types. You also need to know which data type you are dealing with in order to choose the right visualization method. Think of data types as a way to categorize different types of variables. We already discussed different types of variables in Section 2.3.1.3.

For setting up the pipeline it makes sense to work with a subset only, thus, we sample from the available data 10 percent. Furthermore, I decided to include only those observations that are complete. However, this decision should not be made carelessly.

from nycflights13 import flights
import pandas as pd

# Select a sample from the whole data set
flights = pd.DataFrame(flights)
# takes a sample of 10000 entries, with copy() it is specified, that changes to flights_sample doesn't effect flights, else you will get a SettingWithCopyWarning
flights_sample = flights.sample(n=10000, replace=True, random_state=1).copy() 

# dimensions of the data set
flights_sample.shape
(10000, 19)
# remove all observations that are not complete (missing a value)
# axis=0: Drop incomplete rows, how='any' if any value is missing drop, inplace=True: modify DataFrame instead of creating new one
flights_sample.dropna(axis = 0, how = 'any', inplace = True) 

# dimensions of the data set
flights_sample.shape
(9732, 19)
# show first 10 rows
flights.head(10)
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour
0 2013 1 1 517.0 515 2.0 830.0 819 11.0 UA 1545 N14228 EWR IAH 227.0 1400 5 15 2013-01-01T10:00:00Z
1 2013 1 1 533.0 529 4.0 850.0 830 20.0 UA 1714 N24211 LGA IAH 227.0 1416 5 29 2013-01-01T10:00:00Z
2 2013 1 1 542.0 540 2.0 923.0 850 33.0 AA 1141 N619AA JFK MIA 160.0 1089 5 40 2013-01-01T10:00:00Z
3 2013 1 1 544.0 545 -1.0 1004.0 1022 -18.0 B6 725 N804JB JFK BQN 183.0 1576 5 45 2013-01-01T10:00:00Z
4 2013 1 1 554.0 600 -6.0 812.0 837 -25.0 DL 461 N668DN LGA ATL 116.0 762 6 0 2013-01-01T11:00:00Z
5 2013 1 1 554.0 558 -4.0 740.0 728 12.0 UA 1696 N39463 EWR ORD 150.0 719 5 58 2013-01-01T10:00:00Z
6 2013 1 1 555.0 600 -5.0 913.0 854 19.0 B6 507 N516JB EWR FLL 158.0 1065 6 0 2013-01-01T11:00:00Z
7 2013 1 1 557.0 600 -3.0 709.0 723 -14.0 EV 5708 N829AS LGA IAD 53.0 229 6 0 2013-01-01T11:00:00Z
8 2013 1 1 557.0 600 -3.0 838.0 846 -8.0 B6 79 N593JB JFK MCO 140.0 944 6 0 2013-01-01T11:00:00Z
9 2013 1 1 558.0 600 -2.0 753.0 745 8.0 AA 301 N3ALAA LGA ORD 138.0 733 6 0 2013-01-01T11:00:00Z

4.3.3. Scatterplot#

The next step is to get a first overview about the data, and for this, we can use a visualization already. For this I use a simple scatterplot.

import altair as alt

alt.data_transformers.disable_max_rows() # Disables the max rows restriction on handling data
fly_viz1 = flights_sample
fly_viz1 = fly_viz1.reset_index() # reset_index copies the data frame index into the new index column

# Visualize Data 1 - Scatterplot
alt.Chart(fly_viz1).mark_circle(size=40).encode( 
    x=alt.X('index', title='Flight ID'),
    y=alt.Y('dep_delay', title='Departure delay (in min)'),
    tooltip=['flight', 'year', 'dep_time', 'dep_delay'] # when you hover over the points, these data will be shown
).properties( # you can set the size of the chart with properties
    width=550,
    height=400
    ).interactive()

Fig. 4.1 Scatterplot of delay times.#

This is not very informative because this plot is not structured. However, let us reflect about the visualization for a moment. A scatterplot encodes two quantitative variables using both the vertical and horizontal spatial position channels., and the mark type is necessarily a point. They are highly effective for judging the correlation between two attributes. Scatterplots are often augmented with color coding to show an additional attribute. We talk about these characteristics in detail again.

Table: Characteristics of a scatterplot [Munzner, 2014]

Idiom

Scatterplot

What: Data

Table: two quantitative value attributes.

How: Encode

Express values with horizontal and vertical spatial position and point marks

Why: Task

Find trends, outliers, distribution, correlation; locate clusters.

Scale

Items: hundreds

Let’s sort the values and change the graphical representation to make it easier to see.

# Visualize Data - Scatterplot with ordered values
# with copy() it is specified, that changes to flights_sample doesn't effect flights, else you will get a SettingWithCopyWarning
fly_viz2 = flights_sample.copy() 
# 'sort_values' sorts a variable, here dep_delay, in descending order
# we have to ignore the previous index, otherwise the data frame index will not be reset
fly_viz2 = fly_viz2.sort_values(by=['dep_delay'], ignore_index=True) 
# limit data to dep_delay<800
fly_viz2 = fly_viz2[fly_viz2.dep_delay<800]

# create new column 'index' with row numbers from column data
fly_viz2 = fly_viz2.reset_index()

alt.Chart(fly_viz2).mark_circle(size=40).encode(  
    x=alt.X('index', title='Ordered Flight ID'),
    y=alt.Y('dep_delay', title='Departure delay (in min)'),
    tooltip=['flight', 'year', 'dep_time', 'dep_delay']
).properties( 
    width=550,
    height=400
    ).interactive()

Fig. 4.2 Scatterplot of delay times.#

What do you think of this chart? What can you say about flight delay times now? In the following, we focus on the delays only, since many flights seems to be one time.

# Remove all flights up to max dep_delay 800
flights_sample = flights_sample[flights_sample.dep_delay<=800]

# dimensions of the data set
flights_sample.shape
(9732, 19)

4.3.4. Histogram#

Let’s now create a graphical summary of these variables. Let’s start with a histogram. It divides the range of the dep_delay attribute into equal-sized bins and then plots the number of observations within each bin. What additional information does this new visualization give us about this variable?

The idiom of histograms shows the distribution of elements within an attribute. In the example, you can see a histogram of the weight distribution for all cats in a neighborhood, binned into 5-pound ranges.

The visual coding of a histogram is very similar to bar charts, with a line marker. One difference is that histograms are sometimes displayed with no space between bars to visually imply continuity, while bar charts conversely have spaces between bars to imply discretization. Despite their visual similarity, histograms are very different from bar charts. They do not show the original data but aggregate it.

The number of bins in the histogram can be chosen independently of the number of elements in the data set. The choice of bin size is crucial and tricky: a histogram can look very different depending on the discretization chosen. One possible solution to the problem is to calculate the number of bins based on the features of the data set; another is to provide controls for the user to interactively change the number of bins and see how the histogram changes.

Table: Characteristics of a histogram [Munzner, 2014]

Idiom

Histogram

What: Data

Table: one quantitative value attribute.

What: Derived

Derived table: one derived ordered key attribute (bin), one derived quantitative value attribute (item count per bin).

How: Encode

Rectilinear Layout. Line mark with aligned position to express derived value attribute. Position: key attribute.

fly_viz3 = flights_sample[flights_sample.dep_delay<=60]

alt.Chart(fly_viz3).mark_bar().encode(
  x=alt.X('dep_delay', title='Departure delay (in min)'),
  y=alt.Y('count()', title='Number of Flights')
).properties( 
    width=550,
    height=400
    )

Fig. 4.3 Histogram of delay times.#

In the standard function the number of bins are 30, but of course, you can change them easily. The choice of binwidth significantly affects the resulting plot. Smaller binwidths can make the plot cluttered, but larger binwidths may obscure nuances in the data.

s1 = alt.Chart(flights_sample).mark_bar().encode(
  x=alt.X('dep_delay', title='Departure delay (in min)', bin=alt.Bin(extent=[-20, 60], step=1)),
  y=alt.Y('count()', title='Number of Flights')
).properties( 
    width=200,
    height=150
    )

s5 = alt.Chart(flights_sample).mark_bar().encode(
  x=alt.X('dep_delay', title='Departure delay (in min)', bin=alt.Bin(extent=[-20, 60], step=5)),
  y=alt.Y('count()', title='Number of Flights')
).properties( 
    width=200,
    height=150
    )



s10 = alt.Chart(flights_sample).mark_bar().encode(
  x=alt.X('dep_delay', title='Departure delay (in min)', bin=alt.Bin(extent=[-20, 60], step=10)),
  y=alt.Y('count()', title='Number of Flights')
).properties( 
    width=200,
    height=150
    )

s15 = alt.Chart(flights_sample).mark_bar().encode(
  x=alt.X('dep_delay', title='Departure delay (in min)', bin=alt.Bin(extent=[-20, 60], step=15)),
  y=alt.Y('count()', title='Number of Flights')
).properties( 
    width=200,
    height=150
    )

alt.vconcat((s1 | s5), (s10 | s15))

Fig. 4.4 Histogram of delay times.#

4.3.5. Density Plot#

A Density Plot is a smoothed, continuous version of a histogram that visualizes the underlying probability distribution of the data by a continuous curve^[An excellent introduction in the usefulness of this method is given by Claus Wilke, check out https://clauswilke.com/dataviz/histograms-density-plots.html]. The peaks of a Density Plot help display where values are concentrated over the interval. The most common form of estimation is known as kernel density estimation. In this method, a continuous curve (the kernel) is drawn at every individual data point and all of these curves are then added together to make a single smooth density estimation. The kernel most often used is a Gaussian (which produces a Gaussian bell curve at each data point).

Just as is in the case with histograms, the exact visual appearance of a density plot depends on the kernel and bandwidth choices. In addition, the choice of the kernel affects the shape of the density curve.

# Visualize Data - Density Plot
alt.Chart(fly_viz3).transform_density(
  'dep_delay',
  as_=['Departure delay (in min)', 'density'],
).mark_area().encode(
  x="Departure delay (in min):Q",
  y='density:Q'
).properties( 
    width=550,
    height=400
    )

Fig. 4.5 Density Plot of delay times.#

4.3.6. Boxplot#

Another alternative to display the distribution of a continuous variable broken down by a categorical variable is the Boxplot. The boxplot is an idiom presenting summary statistics for the distribution of a quantitative attribute, using five derived values Fig. 4.6.

_images/boxplot.png

Fig. 4.6 Properties of a Boxplot#

A box that extends from the 25th percentile (lower quartile, Q1) of the distribution to the 75th percentile (higher quartile, Q3), a distance called the interquartile range (IQR). In the center of the box is a line indicating the median, or 50th percentile, of the distribution. These three lines give you an idea of the spread of the distribution and whether the distribution is symmetrical about the median or skewed to one side. Furthermore, a line (or whisker) extending from each end of the box to the furthest non-outlier point (Q1-1.5 x IRQ and Q3+1.5 x IRQ) in the distribution which indicates the range. Visual points indicating observations that fall more than 1.5 times the IQR from each edge of the box. These outer points are unusual, so they are plotted individually.

Boxplots are useful when we want to visualize many distributions at once and/or if we are primarily interested in overall shifts among the distributions.

Table: Characteristics of a boxplot [Munzner, 2014]

Idiom

Boxplot

What: Data

Table: many quantitative value attributes.

What: Derived

Five quantitative attributes for each original attribute, representing its distribution.

Why: Task

Characterize distribution; find others, extremes, averages; identify skew.

How: Encode

One glyph per original attribute expressing derived attribute values using vertical spatial position, with 1D list alignment of glyphs into separated with horizontal spatial position.

How: Reduce

Item aggregation.

Scale

Items: unlimited. Attributes: dozens.

# Visualize Data - Box Plot 
fly_viz2['x'] = ""
alt.Chart(fly_viz2, width=200).mark_boxplot().encode( # control the size of the chart with hight/width
  alt.X('x'),
  alt.Y('dep_delay:Q').scale(zero=False)
).properties( 
    width=550,
    height=400
    )

Fig. 4.7 Box Plot of delay times.#

import numpy as np # is needed to calculate the logarithm

# find the minimum of dep_delay
min_delay = fly_viz2['dep_delay'].min() 
# substract the min_delay from dep_delay in each row
fly_viz4 = fly_viz2.copy()
fly_viz4["dep_delay_min"] = fly_viz2.dep_delay - min_delay
fly_viz4 = fly_viz4[fly_viz4.dep_delay_min!=0] # to avoid loc(0) in the following example

# create a new column that contains the logarithm of the previously subtracted dep_delay
fly_viz4['log_dep_delay'] = np.log(fly_viz4['dep_delay_min'])

# Visualize Data - Box Plot with log scale
alt.Chart(fly_viz4, width=200).mark_boxplot().encode( 
  alt.Y('log_dep_delay:Q').scale(zero=False)
).properties( 
    width=550,
    height=400
    )

Fig. 4.8 Box Plot of delay times (log scale).#

4.3.7. Compare Distributions#

Now we can start looking at the relationship between pairs of attributes. That is, how are each of the distributional properties we care about (central trend, spread and skew) of the values of an attribute changing based on the value of a different attribute. Suppose we want to see the relationship between departure delay time (a numeric variable), and the airport origin (a categorical variable).

# Visualize Data - Box Plot in groups
alt.Chart(fly_viz4, width=200).mark_boxplot().encode(
  alt.X('origin:N'),
  alt.Y('log_dep_delay:Q').scale(zero=False)
).properties( 
    width=550,
    height=400
    )

Fig. 4.9 Box Plot of delay times.#

4.3.8. Visualizing Multiple Distributions at Once#

# limit arrival delay time to -60 - 120
fly_viz3 = fly_viz3[fly_viz3.arr_delay<120]
fly_viz3 = fly_viz3[fly_viz3.arr_delay>-60]

# rename used carriers
# Legend of carrier names: https://nycflights13.tidyverse.org/reference/airlines.html
airlines_s = ['UA', 'B6', 'EV', 'DL', 'AA', 'AS']
airline_list = ['United Air Lines Inc.', 'JetBlue Airways', 'ExpressJet Airlines Inc.', 'Delta Air Lines Inc.', 'American Airlines Inc.', 'Alaska Airlines Inc.']
for i in range(6):
    fly_viz3 = fly_viz3.replace(airlines_s[i], airline_list[i])

# Set the charts and add them in one chart
d = {}
for i in range(5):
    key = str("a"+str(i))
    d[key] = alt.Chart(fly_viz3).mark_bar(opacity=0.3).encode(
        x=alt.X('arr_delay', title='Arrival delay (in min)', bin=alt.Bin(step=5)),
        y=alt.Y('count()', title='Number of Flights'),
        color=alt.Color('carrier:N', title='Airlines')
    ).transform_filter(
        alt.FieldEqualPredicate(field='carrier', equal=airline_list[i])
    ).properties( 
        width=550,
        height=400
        )

d['a0'] + d['a1'] + d['a2'] + d['a3'] + d['a4']

Fig. 4.10 Arrival delays at NYC airport: Overlapping Histogram.#

import matplotlib.pyplot as plt
# Reusable Matplotlib/Seaborn plot formatting
def plot_formatting():
    # Plot formatting
    plt.legend(bbox_to_anchor=(1.02, 1), loc='upper left', borderaxespad=0, frameon=False)
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    plt.style.use('seaborn-v0_8-whitegrid')
    plt.show()
#fig, ax = plt.subplots()
fig, ax = plt.subplots(figsize=(7,5), dpi=150)

# Assign colors for each airline and the names
colors = ['#E69F00', '#56B4E9', '#F0E442', '#009E73', '#D55E00']
names = ['United Air Lines Inc.', 'JetBlue Airways', 'ExpressJet Airlines Inc.',
         'Delta Air Lines Inc.', 'American Airlines Inc.']

# Make a separate list for each airline
d = {}
for i in range(5):
    key = str("x"+str(i))
    d[key] = list(fly_viz3[fly_viz3['carrier'] == names[i]]['arr_delay'])
     
# Make the histogram using a list of lists
# Normalize the flights and assign colors and names
plt.hist([d['x0'], d['x1'], d['x2'], d['x3'], d['x4']], bins = int(180/15),
         color = colors, label=names, density=True)

plt.xlabel('Delay (min)')
plt.ylabel('Normalized Flights')
plt.title('Side-by-Side Histogram with Multiple Airlines')
plt.xlim(-75,125)
plt.ylim(0,0.025)
plot_formatting()
_images/eed1d83d8e3cf3a703debf808aa8b7fe99145b877da26e2ac582e3b9bf7fa6f6.png

Fig. 4.11 Arrival delays at NYC airport: Side-by-Side Histogram.#

# Make the histogram using a list of lists
# Normalize the flights and assign colors and names
plt.hist([d['x0'], d['x1'], d['x2'], d['x3'], d['x4']], bins = int(180/15),
         color = colors, label=names, density=True, stacked=True)

# Plot formatting
plt.xlabel('Delay (min)')
plt.ylabel('Normalized Flights')
plt.xlim(-75,125)
plt.ylim(0,0.025)
plot_formatting()
_images/cbae4c20e7098b72dd6c581e5c25344c390703e2e67dbceb9bedf7b9f2ad3c91.png

Fig. 4.12 Arrival delays at NYC airport: Stacked Histogram.#

# transform_density gives a density plot
alt.Chart(fly_viz3).transform_density(
  'arr_delay',
  as_=['Arrival delay (in min)', 'Density'],
  groupby=['carrier']
# you can choose how you want to visualize the density plot (here line)
).mark_line(  
).encode(
  x="Arrival delay (in min):Q",
  y='Density:Q',
  color=alt.Color('carrier:N', title='Airlines')
).transform_filter(
    alt.FieldOneOfPredicate(field='carrier', oneOf=['United Air Lines Inc.', 'JetBlue Airways', 'ExpressJet Airlines Inc.', 'Delta Air Lines Inc.', 'American Airlines'])
).configure_range(
    category=alt.RangeScheme(colors)
).properties( 
    width=550,
    height=400
    )

Fig. 4.13 Arrival delays at NYC airport: Density Plot with Multiple Airlines.#

fly_viz3 = fly_viz3[fly_viz3.arr_delay<150]
fly_viz3 = fly_viz3[fly_viz3.arr_delay>-70]

fly_viz3 = fly_viz3.replace('AS', 'Alaska Airlines Inc.')

al1 = alt.Chart(fly_viz3).transform_density(
  'arr_delay',
  as_=['Arrival delay (in min)', 'density'],
  groupby=['carrier']
).mark_area(opacity=0.4  
).encode(
  x='Arrival delay (in min):Q',
  y='density:Q',
  color=alt.Color('carrier:N', title='Airlines')
).transform_filter(
    alt.FieldEqualPredicate(field='carrier', equal='United Air Lines Inc.')
).properties( 
    width=550,
    height=400
    )


al2 = alt.Chart(fly_viz3).transform_density(
  'arr_delay',
  as_=['Arrival delay (in min)', 'density'],
  groupby=['carrier']
).mark_area(opacity=0.4  
).encode(
  x='Arrival delay (in min):Q',
  y='density:Q',
  color=alt.Color('carrier:N', title='Airlines')
).transform_filter(
    alt.FieldEqualPredicate(field='carrier', equal='Alaska Airlines Inc.')
).properties( 
    width=550,
    height=400
    )

al1 + al2

Fig. 4.14 Arrival delays at NYC airport: Shaded Density Plot.#

import seaborn as sns
# Subset to Alaska Airlines
subset = fly_viz3[fly_viz3['carrier'] == 'Alaska Airlines Inc.']

# Density Plot with Rug Plot
sns.displot(subset['arr_delay'], 
    rug = True,
    kind="kde",
    height=4.5,
    aspect=1.5,
    color = 'darkblue', 
    rug_kws={'color': 'black'})

# Plot formatting
# Plot formatting
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.style.use('seaborn-v0_8-whitegrid')
plt.xlim((-100,100))
plt.ylim((0,0.020))
plt.show()
_images/f85c66d02fc60ba15073f739feffcf42b7008daa03a6f4d6593087d26e1703e1.png

Fig. 4.15 Arrival delay of Alaska Airlines Inc. at NYC airport: Rug Plot.#

# ECDF Plot
sns.ecdfplot(data=subset, x=subset['arr_delay'])

# Plot formatting
plt.xlabel('Delay (min)')
plt.xlim(-60,80)
plt.show()
_images/bafd30d053d3e658de729a4de5d4e5154eb7c30140444127ba0e163e9cd7685a.png

Fig. 4.16 Arrival Delays of Alaska Airlines: Empirical Cumulative Distribution Functions.#

flights_s = fly_viz1.copy()
flights_s.dropna(axis = 0, how = 'any', inplace = True) 

# limit arrival delay time to -60 - 120
flights_s = flights_s[flights_s.arr_delay>0]

fig, ax = plt.subplots(1,2, figsize=(8,5), dpi=120)

# Density Plot with Rug Plot
sns.ecdfplot(data=flights_s, x=flights_s['arr_delay'], ax=ax[0])

sns.kdeplot(flights_s['arr_delay'], ax=ax[1])

# Plot formatting
plt.xlim((1, 1500))
plt.xlabel('Arrival Delay (min)')
plt.ylim(0,0.02)
plt.show()
_images/b46f752a1074ec3db3140b07925eb1354bbfe91dc11bcaa1e705bcab27021721.png

Fig. 4.17 Arrival Delays: Highly Skewed Distributions#

# Define 2 columns
fig, ax = plt.subplots(1,2, figsize=(8,5), dpi=120)

# Density Plot with Rug Plot of the logarithm of arrival times
sns.ecdfplot(data=flights_s, x=flights_s['arr_delay'], ax=ax[0], log_scale=10)

sns.kdeplot(flights_s['arr_delay'], ax=ax[1], log_scale=10)

# Plot formatting
plt.xlabel('Arrival Delay (min)')
plt.xlim(1, 1500)
plt.ylim(0,0.7)
plt.show()
_images/8d18b1c31834fc53fa2e5685103d5a0f8b5291affbf4c89390d98a3af562fc97.png

Fig. 4.18 Distribution of the logarithm of Arrival Times.#

# limit arrival delay time to -60 - 120
fly_viz1 = fly_viz1[fly_viz1.arr_delay<120]
fly_viz1 = fly_viz1[fly_viz1.arr_delay>-60]

options = ['UA', 'B6', 'EV', 'DL', 'AA']
fly_filtered = fly_viz1[fly_viz1['carrier'].isin(options)]
sns.boxplot(data=fly_filtered, x='carrier', y='arr_delay', whis=(0, 100))

plt.xlabel('Airlines')
plt.ylabel('Arrival Delay (min)')
plt.ylim((-75,150))
plt.show()
_images/4ca1c7a1d63ef01cf7550e2ed1a4ecdbf4a30b5a2bba85dd3f83b7b38cdb9c18.png

Fig. 4.19 Arrival Delays: Box Plot.#

fly_filtered = fly_viz1[fly_viz1['carrier'].isin(options)]
sns.violinplot(data=fly_filtered, x='carrier', y='arr_delay')

plt.xlabel('Airlines')
plt.ylabel('Arrival Delay (min)')
plt.ylim(-75,150)
plt.show()
_images/6588bd1a3db9f4e270391879b6b7652ad4ba11185bfbdec065b973fcb19a7463.png

Fig. 4.20 Arrival Delays: Violin Plot.#